Forecasting The Spanish Electricity Power Generation Load and Cost of MWH USING ARIMA and DEEP Leering - TensorFlow LSTM

Analysis and Model Development by Tamer Hanna

Abstract

The objective of this project is to perform various Time Series Analysis and modeling to understand the Power generation market of Spain and to be able to predict the hourly load and use it for the prediction of the Euro per Megawatt hour.

Several open datasets from different websites are used in this analysis.

1- The Power Generation dataset that includes all the different sources of generation (hourly dataset for records between 31st Dec. 2014 to 31st Dec 2018 1,016,305 records of data)

2- Weather conditions in Spain (Temperature, Humidity, Pressure, Rain, clouds) for 5 big cities (hourly dataset for weather records between 31st Dec. 2014 to 31st Dec 2018 3,032,732 for ['Valencia' 'Madrid' 'Bilbao' ' Barcelona' 'Seville'])

3- Global price of Coal monthly average for the period between 2015 and 2020 in USD (49 records)

4- History of Henry Hub natural gas prices in USD for the last 10 years (3290 records)

5- West Texas Intermediate (WTI or NYMEX) crude oil prices per barrel from 2008 to 2020 in USD (3268 records)

6- Euro Dollar Daily Exchange Rate (EUR USD) - for the period starting from 1999 to 2020 (5825 records)

Datasets links

https://www.kaggle.com/nicholasjhana/energy-consumption-generation-prices-and-weather

https://www.macrotrends.net/1369/crude-oil-price-history-chart

https://www.macrotrends.net/2478/natural-gas-prices-historical-chart

https://www.macrotrends.net/2548/euro-dollar-exchange-rate-historical-chart

https://fred.stlouisfed.org/series/PCOALAUUSDM#0

Introduction

In this notebook I am exploring the Power Generation sector in Spain and what is affecting the price of MWH. the models built is to allow the privet power generation plants to be able to bid on the generation through forecasting the hourly load and the cost We are trying to predict the hourly load based on the hourly history of using ARIMA model and the cost of the MWH in EUROS using Tensor flow- LSTM deep learning model

The Electrical Energy Generation Price depends on several factors that affects the cost built in it, and it is divided into Fixed cost and variable costs that is based on several inputs as :

1- The Load or the required electric consumption which is affected by several factors as the weak days and hours, weather conditions, population density.

2- The Generation Facility running costs in terms of Operations and maintenance and initial capital invested in it

3- The Fuel cost that is changing from day to day.

4- The transmission cost.

The Datasets we are using should be able to help in forecasting the price range and the load with the all the variables and history of data provided. The Energy dataset is composed of energy demand, The Generation load from different sources and price of MWH and forecasted price. The dataset is unique because it contains hourly data for electrical consumption and the respective forecasts for consumption and pricing. This allows prospective forecasts to be benchmarked against the current state of the art forecasts being used in industry. Based on the various approaches to implementing a time series application. By adding the market prices change of the fuels used in generation and the exchange rate from USD to Euros this can add more forecasting power to the models for forecasting the price

Ethical ML Framework

Data Governance:

As the goal of this report is only to research the time series methods, many aspects of the ethical ML framework do not directly apply. The dataset used in this report are accessible to the public websites for the study and analysis of machine learning technics and not for any business use and does not require any approval or request to use it for this reason as stated by the publishers
1- ENTSOE(European Network of Transmission System Operators for Electricity), a public portal for Transmission Service Operator (TSO) data. Settlement prices were obtained from the Spanish TSO Red Electric Espana. 2- Weather data is available on Kaggle by owner and open to public study use. 3- Prices of Fuels and exchange rate are available and open for public use are from Micro trends and Fred Economic data research

This data is used as provided on the websites without any certificates or confirmation of the accuracy of and is not tested or verified by any sort test or expert or compared to any source of data through calibrated devices. However, since this data will be used for the discovery of the machine learning tools and techniques and not to drive any decision related to Energy usage or cost analysis then these verifications are not required for the purpose of this report

Accuracy/Trust of the model:

The output models and analysis of this report are not intended to be put in use or even to be used to give any advice related to the field or energy usage or environment affect or socioeconomic studies The data and the analysis are manly done for the exploration of the Data Science and may be some assumptions and interpolations are done that will affect the result of the analysis

Social Impact

Since this report is focused on exploring the Machine learning tools and technics the inputs, outputs and results are not verified and could have a negative affect or biases on the community. It also should not be used in any decisions that could have a Social Impact or through businesses operations and decisions related to sustainability in the Energy sector.

In case of applying the same techniques in life scenario, the machine learning techniques, data inputs and results should be assessed against the ML framework for the Social Impact, Accuracy/Trust and Governance

Data Cleaning

Energy Genration dataset Cleaning

We found 2 Important findings from the above 2 steps

1- The following columns has 0 values for all cells 'generation fossil coal-derived gas ' , 'generation fossil oil shale', generation fossil peat', 'generation geothermal ','generation wind offshore ','

2- ' generation hydro pumped storage aggregated' and 'forecast wind offshore eday ahead' are all null value so we will drop these 2 coulmns and the rest of null values we willl will look at them later

We will have to drop all these columns from the dataset

We will try to plot a zoom on the a period which is missing some values

From the above curve and table:

1- It appears that te null values are common across all the genration plant types and the total so this means we will not be abel to calculate the values and we will have to interpolate this values using linear interpolation

2- As we know from the power genrtaion concepts Neuclear power genration load and demand is alwyas required to be stable for longer periods and the flutiuations in the load are manged through the other types of genration

As We can see and understand from the above graph the destribution of the genration across resources varies but some resource are dominant and stable as the neuclear plants as fluctiuation is not easy to do durig the opertaions of such plants.

We can see alot of fluctiation between fossil fuels manly the coal and gas plants and clean energy resources and hydor flutiuates depending on the weather and season of the year.

It seems also that the genration from waste is does not have a big persentage of the spanish grid genration

The Histogram show an anlysis for all the sources of power genration and the limit of each source of genration and we can see the normal operating range of genration for this source

Weather dataset cleaning

First we started to check the values within the dataset to see if there is any outliers as it has direct effect on generation in terms due to the following reasons

1- temperature and Humidity Demand due to air conditioning and heating

2- temperature and Humidity affects the generating thermal as it affects the cooling process for the equipment and efficiency of the plants

3- Wind speed affects the wind generation

4- Rain and clouds have direct effect on generation through solar generation farms.

so first we will look at the max and mi of all the values and see if they make sense or not

1- For temperature the max is 321 k which is equivalent to 47.85 C which might happen in some rare very summer days

2- For Temperature the low is 262.2 K which is equivalent to -10.95 chick still make sense to reach this temperature during winter coldest days in northern Spain

3- Pressure: it seems we have some problems with the pressure max. and min in hPA as it has up normal pressure like 0 and 1008371.0 which cannot happen unless there is a problem in the measurement. Though my knowledge in the domain of the power generation the pressure factor can be neglected from this detail as it very minor to neglected effect so we will drop it from the data

4- Humidity: has 0 which is very highly unlikely to happen so will cap the humidity at max 90 % and low 35%

5- Wind speed: looking at the wind speed it seems that there are some strange values that is reaching 133 m/s which is impossible the max wind speed recorded in Madrid was 7.8m/s over the last 10 years so we will need to clean this data.

6- we will drop weather_id, weather_main, weather_description, weather_icon as the details in the other columns are enoughto indicate all waether condtions.

Based on the above finiding the number of rows are not equal for all cities so we will have to dig more to find if the dates are for the same range or different ranges also if there is duplicates or not

Now the 5 dataframes for the weather are ready to be merged with the enrgy dataframe

Oil Price

Natural Gas Prices

Coal prices

Since the Data set of the Coal is monthly average will have to fill the hourly values with the same monthly average

Exchange rate of USD and EURO

Combining the Datasets

We will copy the range of the data to match energy dataset

Predection using Univariate ARIMA model for the load

From the above Graphs it seems tha there is no increasing or deacresing trend in the load and the data seems to be a stationary

ACF and PACF for the Actual load , respectively indicate that there is a high autocorrelation between the values in an oscillating manne.

As P is 0 then the data set is stationary and no need to apply the differencing on it

The best lowes AIC is 568623.216 from PDQ (1,1,1) (1,1,1,12)

ARIMA Model Fitting

So how to interpret the plot diagnostics?

Top left: The residual errors seem to fluctuate around a mean of zero and have a uniform variance.

Top Right: The density plot suggest normal distribution with mean zero.

Bottom left: The middle values of the sample are close to what we would expect from normally distributed data, as it follows the straight line from the diagram closely.However, it seems the underlying data distribution presents extreme values more often than a normal one, that's why you see the points going under the line for big negative values and over it for big positive ones.

Bottom Right: The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model. So you will need to look for more X’s (predictors) to the model.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Predection

Forcasting the Cost of the MWH with Tensorflow LSTM

In the following section we will explore the full dataset and the target variable whi is the cost of the MW through the following steps

1- Finding the coorelations to the target variable

2- Understaing the features and selecting what will be included in the mdoel

Since the P value is 0 the dataset is sattionary

Feature Selection for Multivarite Forcasting

First step is bulding the correlation matrix

Feature Selection

we will use XGBoost for unsderstanding feature importance

Feature Importance in Gradient Boosting A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or another more specific error function.

The feature importances are then averaged across all of the the decision trees within the model.

By comparing the coorelation matrix and the result of the XGBoost model we selected some features that we intailly expected these feature will have an effect on the forecasting we ignored sevral features for example the exchange rate was low on the importance grediant and the coorealtion matrix.

We can see that the Hard Coal genration and the coal price has a higer importance and score in the correlation matrix as 20% of the Power genration in Spain comes from Coal

Checking the RMSE for basline forcast provided with the datset

We will compare the Price forcast to the actual value that are both provided in the dataset use and we will use this a baseline for the model we are bulding

Multivariate forecasts using Tensorflow keras LSTM

The learning rate that we will use for Adam is equal to 5e-3.

Conclusion

Based on the results of the RMSE of 2.257 which is far better than the prediction values provided by the TSO which is 12.334 From the analysis We will need to add more features related to the time as the Week day or weekend also the hour of the week as this has a high effect on the generation

Many variables that had a high prediction power was actually expected and made full since as they have a direct effect either on the load requirements or the cost of generation

I was expecting to see more of the temperature variables having effects on the prediction power however it was not, and this may be because the temperature difference in Spain from summer to winter is not so huge.

I was expecting to see very clear seasonality in the load, but it was not very clear and this also can be because the difference in temperatures between winter and summer is not as huge as in Canada for an example.

App Deployment

An app that uses the model will be developed on using Dash to help in predicting the load an the cost of MWH and it will be taking into consideration the weather conditions and the price of Fossil fuels of the day.